Search CORE

22 research outputs found

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Author: Brown C. Titus
Canino-Koning Rosangela
Howe Adina Chuang
Pell Jason
Zhang Qingpeng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Assembling large, complex environmental metagenomes

Author: Brown C. Titus
Howe Adina Chuang
Jansson Janet
Malfatti Stephanie A.
Tiedje James M.
Tringe Susannah G.
Publication venue
Publication date: 12/12/2012
Field of study

The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two pre-assembly filtering approaches, digital normalization and partitioning, to make large metagenome assemblies more comput\ ationaly tractable. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio

arXiv.org e-Print Archive

eScholarship - University of California

Assembling large, complex environmental metagenomes

Author: Howe Adina Chuang,
Publication venue
Publication date: 15/05/2017
Field of study

Ezid

Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem

Author: Chuang Howe Adina
Hall Steven
Hall Steven
Howe Adina
Lawrence Nathaniel
Smith Schuyler
Sooksa-Nguan Thanwalee
Tenesaca Carlos
Yu Wenjuan
Publication venue
Publication date: 26/03/2021
Field of study

Soil microorganisms mediate biogeochemical processes, but how microbial community composition influences these processes remains contested. We combined monthly sequencing of soil 16S rRNA genes and intensive measurements of nitrogen (N), carbon (C), and iron (Fe) cycling along a topographic gradient in a poorly drained intensive agricultural ecosystem (corn–soybean rotation) in the midwestern United States. Observed microbial composition changed little over time within and among years despite large differences in weather and crop type. Yet, microbial composition varied greatly with topographic location and correlated strongly with moisture, soil organic carbon (SOC), and especially pH. Microbial families, genera, and/or amplicon sequence variants often correlated significantly with measured biogeochemical processes or pools, yet different taxa within the same phylogenetic groups often responded in opposite ways, indicating a lack of ecological coherence among close relatives. Dominant phyla were generally similar across the topographic gradient but specific members showed consistent tradeoffs among locations. Ammonia oxidizing archaea and bacteria sequences varied oppositely with pH across the gradient, but their combined relative abundances remained similar, as did potential nitrification rates. Nitrospira sequences correlated positively with nitrous oxide (N2O) fluxes, suggesting a direct or indirect contribution of nitrification (or possibly comammox) to N2O production. We also found significant linkages between taxonomic groups and redox-sensitive Fe pools, indicating a role for redox variation in structuring microbial communities. Several globally dominant bacteria identified previously correlated significantly with measured biogeochemical variables, providing insights into their possible functional roles. Overall, microbial composition provided a coarse measure of several key biogeochemical functions and implicated taxa that possibly mediate these processes in a widespread agroecosystem of North America.This is a manuscript of an article published as Yu, Wenjuan, Nathaniel C. Lawrence, Thanwalee Sooksa-nguan, Schuyler D. Smith, Carlos Tenesaca, Adina Chuang Howe, and Steven J. Hall. "Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem." Soil Biology and Biochemistry (2021): 108228. doi:10.1016/j.soilbio.2021.108228. Posted with permission.</p

Digital Repository @ Iowa State University (ISU)

Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem

Author: Chuang Howe Adina
Hall Steven
Hall Steven
Howe Adina
Lawrence Nathaniel
Smith Schuyler
Sooksa-Nguan Thanwalee
Tenesaca Carlos
Yu Wenjuan
Publication venue: Iowa State University Digital Repository
Publication date: 26/03/2021
Field of study

Digital Repository @ Iowa State University (ISU)

Recommended from our members

Assembling large, complex environmental metagenomes

Author: Brown C Titus
Howe Adina Chuang
Jansson Janet
Malfatti Stephanie A
Tiedje James M
Tringe Susannah G
Publication venue: eScholarship, University of California
Publication date: 12/12/2012
Field of study

eScholarship - University of California

Recommended from our members

Tackling soil diversity with the assembly of large, complex metagenomes

Author: Brown C Titus
Howe Adina Chuang
Jansson Janet K
Malfatti Stephanie A
Tiedje James M
Tringe Susannah G
Publication venue: eScholarship, University of California
Publication date: 01/04/2014
Field of study

The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches--digital normalization and partitioning--to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil

eScholarship - University of California

Tackling soil diversity with the assembly of large, complex metagenomes

Author: Adina Chuang Howe
C. Titus Brown
James M. Tiedje
Janet K. Jansson
Stephanie A. Malfatti
Susannah G. Tringe
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 14/03/2014
Field of study

The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches—digital normalization and partitioning—to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil

Crossref

PubMed Central

eScholarship - University of California

Iterative low-memory k-mer trimming.

Author: Qingpeng Zhang (274356)
Jason Pell (99648)
Rosangela Canino-Koning (604430)
Adina Chuang Howe (604431)
C. Titus Brown (98658)
Publication venue
Publication date: 25/07/2014
Field of study

The results of trimming reads at unique (erroneous) k-mers from a 5 m read E. coli data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3' end of reads.</p

FigShare

Archivo Digital UPM